--- layout: page title: Bivariate Regression permalink: /Rcoding/coding3/ parent: R Coding nav_order: 3 --- Script 3: Bivariate Regression


Main Commands Used in this Session

cov() # to calculate the covariance
cor()# to calculate the correlation
dim() #shows the dimension of a dataframe
lm() # linear model in regression
plot() # to plot how observations of two variables relate to eachother
subset() # apply a function to subgroups
abline() # to fit in a straight line in a plot
scatterplot() # from the CAR package a plot that also shows non-linear fits
summary() # to check the variables of a data set
boxplot() # create a boxplot
hist() #to look at how the variable is distributed across observations as a histogram
densityPlot() #form the CAR package to see a variable's distribution
stargazer() #to print publishable tables

In this lab, we will focus on working with observational data and predicting patterns in data. Until now you have been describing patterns.

After this lab you will be able to predict the value of a dependent variable (Y) with an independent variable (X). In other words, we test the existence of a relationship between the two variables.

Once we established the relationship, we will be able to make estimates, that is predict values of Y even when we don’t know the respective value for a range of X. We will first go through what the regression equation really means and then apply this on a real dataset looking at the Brexit vote.


Regression: Ordinary Least Squares (OLS)

The simplest relationship between an independent and a dependent variable is a straight line. As a consequence, a regular linear regression equation is expressed as follows:

\[Y=\alpha+\beta X\]

\(\alpha\) is referred to as the constant, or intercept, and denotes the point where the regression line ‘intercepts’ the Y axis. In other words, tt estimates the average value of Y when X equals zero.

\(\beta\) is the coefficient, or slope, and estimates the average change in Y associated with a unit change in X.

Using this function it is possible to predict Y if you know X.

For example, you may or may not be familiar with the following equation: \(Y= 32 + 1.8X\), where the temperature in Fahrenheit is an exact linear function of temperature in Celsius. This relationship is deterministic. However, in the social sciences relationships between variables are almost always inexact, i.e. it might be likely for them to take on certain values, but we can’t say for sure. Therefore, the regression equation is more realistically written as:

\[Y_i=\alpha+\beta X_i + \mu_i\] where \(\mu\) represents the error. This relationship is probabilistic, allowing variability in the values of Y for each value of X.

To find the ‘best’ line - that is a line with the smallest errors - we need to minimize the sum of squared errors:

\[SSE = \sum(Y_i-\hat{Y}_i)^2\]

where \(Y_i\) is the actual value and where \(\hat{Y}_i\) is the predicted value for each unit. The total sum of the errors will include both positive and negative numbers, cancelling each other out. Squaring each error overcomes this problem of opposite signs. The technique that minimizes the sum of the squared errors for linear regression is also known as Ordinary Least Squares estimation (OLS).

In R, the general formula for running a simple linear regression is

#lm(dependent variable ~ independent variable, data = ...)

Working with Observational Data

Next, we apply the technique of OLS regressions to try to understand what the driving factors were behind Leave votes in the 2016 Brexit referendum. It is always a good idea to start your research by eyeballing the data and getting some basic information on it. But before going on, let’s set the working directory and clear our workspace.

#We can set the working directory 
#setwd("~/Desktop/DataAnalysis/script3") 

#clear our workspace
rm(list = ls())

This time, we are also going to install a package to help us plot results later on.

Packages include commands and functions that are not included in base R and often have been created by researchers to serve their particular needs. Its important to know how you can install packages on your own.

#we download a package called "Companion to Applied Regression"
#it helps us to create some of the graphs we are going to use

#install.packages("car") #here we install the package
library(car) #next we load it, note that quotation marks matter!

Make sure you have downloaded the referendum data and moved it to the folder you set as working directory. Next, we load the dataset and check it using the View() function.

brexit_data <- read.csv("brexit_data.csv") # we use read csv since we are using a csv file
View(brexit_data) # we view the data in a new window

# If you can't remember your file name, or can't find the data
# you can always use the following command to manually locate a file
#brexit_data <- read.csv(file.choose())

To inspect the data further we can use summary().

#first we summarize our data
summary(brexit_data)
     Area           Region_Code           Region           Area_Code        
 Length:382         Length:382         Length:382         Length:382        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   Electorate      ExpectedBallots  VerifiedBallotPapers Percent_Turnout
 Min.   :   1799   Min.   :  1424   Min.   :  1424       Min.   :56.25  
 1st Qu.:  72524   1st Qu.: 54879   1st Qu.: 54878       1st Qu.:70.27  
 Median :  96426   Median : 72545   Median : 72544       Median :74.34  
 Mean   : 121728   Mean   : 87904   Mean   : 87901       Mean   :73.75  
 3rd Qu.: 141380   3rd Qu.:104437   3rd Qu.:104437       3rd Qu.:77.91  
 Max.   :1260955   Max.   :790647   Max.   :790523       Max.   :83.64  
                                                                        
   Votes_Cast      Valid_Votes         Remain           Leave       
 Min.   :  1424   Min.   :  1424   Min.   :   803   Min.   :   621  
 1st Qu.: 54876   1st Qu.: 54844   1st Qu.: 23535   1st Qu.: 28669  
 Median : 72545   Median : 72512   Median : 33475   Median : 37574  
 Mean   : 87899   Mean   : 87832   Mean   : 42255   Mean   : 45578  
 3rd Qu.:104437   3rd Qu.:104332   3rd Qu.: 48246   3rd Qu.: 54138  
 Max.   :790523   Max.   :790149   Max.   :440707   Max.   :349442  
                                                                    
 Percent_Remain  Percent_Leave    Rejected_Ballots No_official_mark 
 Min.   :24.44   Min.   : 4.085   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.:39.65   1st Qu.:47.145   1st Qu.: 33.25   1st Qu.: 0.0000  
 Median :45.73   Median :54.266   Median : 46.50   Median : 0.0000  
 Mean   :47.01   Mean   :52.989   Mean   : 66.38   Mean   : 0.6073  
 3rd Qu.:52.86   3rd Qu.:60.347   3rd Qu.: 74.00   3rd Qu.: 0.0000  
 Max.   :95.91   Max.   :75.562   Max.   :614.00   Max.   :39.0000  
                                                                    
 Voting_for_both_answers Writing_or_mark  Unmarked_or_void Percent_Rejected
 Min.   :  0.00          Min.   : 0.000   Min.   :  0.00   Min.   : 0.000  
 1st Qu.: 10.00          1st Qu.: 0.000   1st Qu.: 21.00   1st Qu.: 6.000  
 Median : 16.00          Median : 1.000   Median : 30.00   Median : 7.000  
 Mean   : 23.78          Mean   : 2.188   Mean   : 39.81   Mean   : 7.304  
 3rd Qu.: 27.00          3rd Qu.: 3.000   3rd Qu.: 45.00   3rd Qu.: 8.000  
 Max.   :311.00          Max.   :35.000   Max.   :286.00   Max.   :24.000  
                                                                           
 Residents_total   Population_Density Economically_active_percent
 Min.   :   2203   Min.   :  0.20     Min.   :60.80              
 1st Qu.:  93888   1st Qu.:  2.00     1st Qu.:67.70              
 Median : 125499   Median :  5.60     Median :70.30              
 Mean   : 161331   Mean   : 15.88     Mean   :70.17              
 3rd Qu.: 199823   3rd Qu.: 21.50     3rd Qu.:72.70              
 Max.   :1073045   Max.   :138.70     Max.   :82.10              
 NA's   :38        NA's   :38         NA's   :7                  
 Employed_of_economically_active_percent Unemployed_Age50_74_percent
 Min.   :48.60                           Min.   :0.4000             
 1st Qu.:59.65                           1st Qu.:0.7000             
 Median :63.40                           Median :0.8000             
 Mean   :62.99                           Mean   :0.7909             
 3rd Qu.:66.35                           3rd Qu.:0.9000             
 Max.   :80.50                           Max.   :1.3000             
 NA's   :7                               NA's   :7                  
 Health_very_good Health_very_good_percent  Health_good     Health_good_percent
 Min.   :   832   Min.   :25.79            Min.   :   718   Min.   :25.70      
 1st Qu.: 29786   1st Qu.:30.04            1st Qu.: 30459   1st Qu.:30.25      
 Median : 40721   Median :31.89            Median : 39549   Median :31.17      
 Mean   : 52076   Mean   :32.30            Mean   : 49404   Mean   :31.02      
 3rd Qu.: 62939   3rd Qu.:34.22            3rd Qu.: 59641   3rd Qu.:32.14      
 Max.   :311451   Max.   :48.69            Max.   :306823   Max.   :35.03      
 NA's   :38       NA's   :38               NA's   :38       NA's   :38         
  Health_fair     Health_fair_percent   Health_bad    Health_bad_percent
 Min.   :   243   Min.   : 8.517      Min.   :   51   Min.   :2.033     
 1st Qu.: 12327   1st Qu.:11.433      1st Qu.: 3479   1st Qu.:3.280     
 Median : 16231   Median :12.961      Median : 4980   Median :3.919     
 Mean   : 20586   Mean   :12.847      Mean   : 6820   Mean   :4.103     
 3rd Qu.: 24444   3rd Qu.:14.250      3rd Qu.: 8381   3rd Qu.:4.731     
 Max.   :140751   Max.   :17.891      Max.   :52383   Max.   :8.207     
 NA's   :38       NA's   :38          NA's   :38      NA's   :38        
 Health_very_bad Health_very_bad_percent Single_never_married_percent
 Min.   :   13   Min.   :0.5427          Min.   :21.00               
 1st Qu.:  972   1st Qu.:0.9062          1st Qu.:28.00               
 Median : 1416   Median :1.1362          Median :31.00               
 Mean   : 1997   Mean   :1.1845          Mean   :32.53               
 3rd Qu.: 2463   3rd Qu.:1.3807          3rd Qu.:34.70               
 Max.   :16955   Max.   :2.7703          Max.   :59.90               
 NA's   :38      NA's   :38              NA's   :7                   
 Married_percent Occup_high_manage_admin_profess_percent
 Min.   :24.80   Min.   : 4.00                          
 1st Qu.:45.70   1st Qu.: 7.50                          
 Median :49.40   Median : 9.50                          
 Mean   :48.26   Mean   :10.27                          
 3rd Qu.:52.75   3rd Qu.:12.30                          
 Max.   :59.20   Max.   :35.50                          
 NA's   :7       NA's   :7                              
 Occup_low_manage_admin_profess_percent Occup_intermediate_percent
 Min.   :12.90                          Min.   : 7.2              
 1st Qu.:18.95                          1st Qu.:11.8              
 Median :21.20                          Median :12.9              
 Mean   :21.30                          Mean   :12.9              
 3rd Qu.:23.60                          3rd Qu.:14.1              
 Max.   :32.40                          Max.   :19.4              
 NA's   :7                              NA's   :7                 
 Occup_small_employer_percent Occup_low_supervis_technical_percent
 Min.   : 4.900               Min.   : 2.400                      
 1st Qu.: 8.000               1st Qu.: 6.350                      
 Median : 9.600               Median : 7.400                      
 Mean   : 9.925               Mean   : 7.285                      
 3rd Qu.:11.550               3rd Qu.: 8.300                      
 Max.   :27.400               Max.   :11.500                      
 NA's   :7                    NA's   :7                           
 Occup_semi_routine_percent Occup_routine_percent Earnings_Median
 Min.   : 4.10              Min.   : 2.50         Min.   :21461  
 1st Qu.:12.50              1st Qu.: 8.80         1st Qu.:25374  
 Median :14.70              Median :11.10         Median :27729  
 Mean   :14.38              Mean   :11.25         Mean   :28477  
 3rd Qu.:16.50              3rd Qu.:13.55         3rd Qu.:30987  
 Max.   :22.20              Max.   :22.40         Max.   :42141  
 NA's   :7                  NA's   :7             NA's   :20     
 Earnings_Mean   Bachelors_deg_percent    age_mean       age_median   
 Min.   :23711   Min.   :14.20         Min.   :30.90   Min.   :29.00  
 1st Qu.:29372   1st Qu.:21.80         1st Qu.:38.67   1st Qu.:38.00  
 Median :32554   Median :25.70         Median :40.20   Median :41.00  
 Mean   :34355   Mean   :26.92         Mean   :40.24   Mean   :40.45  
 3rd Qu.:37539   3rd Qu.:30.80         3rd Qu.:42.12   3rd Qu.:44.00  
 Max.   :64937   Max.   :68.40         Max.   :47.70   Max.   :51.00  
 NA's   :14      NA's   :7             NA's   :38      NA's   :38     
    Birth_UK      Birth_UK_percent Birth_other_EU  Birth_other_EU_percent
 Min.   :  2070   Min.   :44.92    Min.   :   54   Min.   : 0.6272       
 1st Qu.: 87229   1st Qu.:87.89    1st Qu.: 1873   1st Qu.: 1.7171       
 Median :115938   Median :92.20    Median : 3379   Median : 2.5365       
 Mean   :139594   Mean   :88.81    Mean   : 5887   Mean   : 3.3280       
 3rd Qu.:156499   3rd Qu.:95.10    3rd Qu.: 6040   3rd Qu.: 3.7116       
 Max.   :834732   Max.   :97.85    Max.   :40324   Max.   :16.7817       
 NA's   :38       NA's   :38       NA's   :38      NA's   :38            
  Birth_Africa     Birth_Africa_percent Birth_MidEast_Asia
 Min.   :   23.0   Min.   : 0.2306      Min.   :    13    
 1st Qu.:  584.5   1st Qu.: 0.5786      1st Qu.:  1126    
 Median : 1485.0   Median : 1.0160      Median :  2763    
 Mean   : 3806.0   Mean   : 1.8598      Mean   :  7502    
 3rd Qu.: 3150.0   3rd Qu.: 2.0072      3rd Qu.:  6648    
 Max.   :37059.0   Max.   :12.8551      Max.   :129948    
 NA's   :38        NA's   :38           NA's   :38        
 Birth_MidEast_Asia_percent Birth_Americas_Carrib Birth_Americas_Carrib_percent
 Min.   : 0.4809            Min.   :   13.0       Min.   : 0.06446             
 1st Qu.: 1.0476            1st Qu.:  370.2       1st Qu.: 0.33896             
 Median : 2.0138            Median :  742.5       Median : 0.55355             
 Mean   : 3.5531            Mean   : 1948.5       Mean   : 1.02443             
 3rd Qu.: 4.1979            3rd Qu.: 1495.5       3rd Qu.: 1.00056             
 Max.   :26.8274            Max.   :25320.0       Max.   :13.59878             
 NA's   :38                 NA's   :38            NA's   :38                   
 Birth_Antarctica_Oceania_Other Birth_Antarctica_Oceania_Other_percent
 Min.   :  19.0                 Min.   :0.04456                       
 1st Qu.: 144.8                 1st Qu.:0.12161                       
 Median : 265.5                 Median :0.19480                       
 Mean   : 531.4                 Mean   :0.31361                       
 3rd Qu.: 482.0                 3rd Qu.:0.31655                       
 Max.   :9314.0                 Max.   :3.84891                       
 NA's   :38                     NA's   :38                            

Each observation of the brexit_data.csv dataset contains information about how each of the 182 regional units (called “local authorities” in the UK) in England, Scotland and Wales voted and some key demographic information about the unit.

The key variables are:

Variable Name Variable Description
Area The name of the electoral unit
Region Name of the region a unit belongs to. Note: a region contains several units
Percent_Turnout The percentage of people who voted
Percent_Remain Percentage of people who voted Remain
Percent_Leave Percentage of people who voted Leave
Health_good_percent The percentage of people with good health in the area
Earnings_Median The median earnings in the area
age_mean The mean age in the area
Birth_UK_percent The percentage UK-born people in the area

Note that these variables come in various measures: For example, age and earnings are coded in their median and mean, respectively; Remain and Leave votes in percentages. Be careful about what variable you call because it influences how you interpret the results!

Let’s explore the dataset in a few simple commands that we have learned in previous labs:

str(brexit_data) #display structure of the dataset
'data.frame':   382 obs. of  61 variables:
 $ Area                                   : chr  "Aberdeen City" "Aberdeenshire" "Adur" "Allerdale" ...
 $ Region_Code                            : chr  "S92000003" "S92000003" "E12000008" "E12000002" ...
 $ Region                                 : chr  "Scotland" "Scotland" "South East" "North West" ...
 $ Area_Code                              : chr  "S12000033" "S12000034" "E07000223" "E07000026" ...
 $ Electorate                             : int  154266 196809 48755 74426 96760 87137 66642 117138 91916 90516 ...
 $ ExpectedBallots                        : int  104809 139014 37253 54268 73870 59295 48734 91203 66948 69828 ...
 $ VerifiedBallotPapers                   : int  104809 139014 37251 54268 73868 59282 48734 91198 66947 69828 ...
 $ Percent_Turnout                        : num  67.9 70.6 76.4 72.9 76.3 ...
 $ Votes_Cast                             : int  104809 139014 37251 54268 73864 59282 48734 91199 66946 69827 ...
 $ Valid_Votes                            : int  104714 138961 37229 54238 73820 59258 48696 91129 66899 69786 ...
 $ Remain                                 : int  63985 76445 16914 22429 29319 32747 29494 34193 20179 28314 ...
 $ Leave                                  : int  40729 62516 20315 31809 44501 26511 19202 56936 46720 41472 ...
 $ Percent_Remain                         : num  61.1 55 45.4 41.4 39.7 ...
 $ Percent_Leave                          : num  38.9 45 54.6 58.6 60.3 ...
 $ Rejected_Ballots                       : int  95 53 22 30 44 24 38 70 47 41 ...
 $ No_official_mark                       : int  0 0 0 0 0 0 0 0 1 0 ...
 $ Voting_for_both_answers                : int  34 19 8 13 7 9 16 22 15 13 ...
 $ Writing_or_mark                        : int  2 1 0 0 1 1 1 2 1 0 ...
 $ Unmarked_or_void                       : int  59 33 14 17 36 14 21 46 30 28 ...
 $ Percent_Rejected                       : int  9 4 6 6 6 4 8 8 7 6 ...
 $ Residents_total                        : int  NA NA 61182 96422 122309 NA NA 149518 119497 117956 ...
 $ Population_Density                     : num  NA NA 14.6 0.8 4.6 NA NA 6.8 10.9 2 ...
 $ Economically_active_percent            : num  73.3 74.9 70.8 68.9 69.9 69.3 68.6 67.5 68.1 72.6 ...
 $ Employed_of_economically_active_percent: num  62.9 69.5 64.9 63.2 63.8 62.2 62.7 61.5 61.3 65.9 ...
 $ Unemployed_Age50_74_percent            : num  0.5 0.5 0.8 0.7 0.7 0.8 1 0.8 0.8 0.8 ...
 $ Health_very_good                       : int  NA NA 17796 29667 36279 NA NA 44201 32328 36643 ...
 $ Health_very_good_percent               : num  NA NA 29.1 30.8 29.7 ...
 $ Health_good                            : int  NA NA 20381 30614 39347 NA NA 50809 37078 37029 ...
 $ Health_good_percent                    : num  NA NA 33.3 31.8 32.2 ...
 $ Health_fair                            : int  NA NA 8948 13930 17945 NA NA 22823 18880 14453 ...
 $ Health_fair_percent                    : num  NA NA 14.6 14.4 14.7 ...
 $ Health_bad                             : int  NA NA 2704 4697 5745 NA NA 6510 6657 4076 ...
 $ Health_bad_percent                     : num  NA NA 4.42 4.87 4.7 ...
 $ Health_very_bad                        : int  NA NA 750 1247 1525 NA NA 1821 1755 1210 ...
 $ Health_very_bad_percent                : num  NA NA 1.23 1.29 1.25 ...
 $ Single_never_married_percent           : num  42.7 27.3 29.6 28 28.8 28.5 29.1 26.8 31.1 29 ...
 $ Married_percent                        : num  40.5 55.9 47.9 52 50.7 51.3 49.9 50 48.3 51.7 ...
 $ Occup_high_manage_admin_profess_percent: num  12 10.4 8.7 7.8 9.4 7.4 7.9 8.9 6 10.2 ...
 $ Occup_low_manage_admin_profess_percent : num  19.5 21.6 22.1 18.6 19.8 21.9 22.7 21.3 16.3 22 ...
 $ Occup_intermediate_percent             : num  12.1 12.1 15.4 10.2 11.7 12.2 12.6 13.8 12.3 13.4 ...
 $ Occup_small_employer_percent           : num  4.9 10 11.9 11.2 9.5 9 13 12.2 8.2 11.7 ...
 $ Occup_low_supervis_technical_percent   : num  8.4 10 7.6 9.8 9 9.5 8.4 7.6 9.6 7.9 ...
 $ Occup_semi_routine_percent             : num  13.5 15.2 15.7 18.3 15.9 15.9 15.3 16.3 17.8 14.5 ...
 $ Occup_routine_percent                  : num  11.7 12.4 9.6 15.1 15.4 13.5 11.5 10.6 19.1 10.3 ...
 $ Earnings_Median                        : int  29249 29918 24677 28785 27872 27470 25632 26137 24249 27795 ...
 $ Earnings_Mean                          : int  35083 38185 30472 34038 34676 29539 29070 32576 27851 33066 ...
 $ Bachelors_deg_percent                  : num  33.2 27 22 22.8 23.2 23.6 28.9 22.8 15.1 24.6 ...
 $ age_mean                               : num  NA NA 43 43.1 41.8 NA NA 45.5 40 39.6 ...
 $ age_median                             : int  NA NA 44 45 43 NA NA 47 41 40 ...
 $ Birth_UK                               : int  NA NA 57390 93845 118664 NA NA 136653 115917 106993 ...
 $ Birth_UK_percent                       : num  NA NA 93.8 97.3 97 ...
 $ Birth_other_EU                         : int  NA NA 1134 1061 1513 NA NA 6497 1694 3348 ...
 $ Birth_other_EU_percent                 : num  NA NA 1.85 1.1 1.24 ...
 $ Birth_Africa                           : int  NA NA 691 307 472 NA NA 1478 412 1680 ...
 $ Birth_Africa_percent                   : num  NA NA 1.129 0.318 0.386 ...
 $ Birth_MidEast_Asia                     : int  NA NA 976 551 721 NA NA 2217 792 3848 ...
 $ Birth_MidEast_Asia_percent             : num  NA NA 1.595 0.571 0.589 ...
 $ Birth_Americas_Carrib                  : int  NA NA 311 245 339 NA NA 909 237 700 ...
 $ Birth_Americas_Carrib_percent          : num  NA NA 0.508 0.254 0.277 ...
 $ Birth_Antarctica_Oceania_Other         : int  NA NA 112 124 131 NA NA 358 64 345 ...
 $ Birth_Antarctica_Oceania_Other_percent : num  NA NA 0.183 0.129 0.107 ...
dim(brexit_data) #the number of observations and variables in the data
[1] 382  61
hist(brexit_data$Percent_Turnout) #how do you interpret the histogram?

hist(brexit_data$Percent_Leave)

hist(brexit_data$age_mean)

How do you interpret these histograms?

Let’s have a closer look at the Leave-vote:

summary(brexit_data$Percent_Leave)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.085  47.145  54.266  52.989  60.347  75.562 
boxplot(brexit_data$Percent_Leave) #How does the the boxplot relate to what you see above?

densityPlot(brexit_data$Percent_Leave) #from the CAR-package

Recalling the commands from the previous script, we can also look at the distribution of the Leave-vote for two groups, individuals who live in an area where the mean age is over 40 and those who live in an area where the mean age is below 40.

under40 <- subset(brexit_data, brexit_data$age_mean<40)
over40 <- subset(brexit_data, brexit_data$age_mean>40)

par(mfrow=c(1,2))
#"par" sets graphical parameters, "mfrow" allows to produce subsequent figures following an nr-by-nc array. In this case, a 1 row by 2 column layout.

hist(under40$Percent_Leave)
hist(over40$Percent_Leave)

par(mfrow=c(1,1))
#here instead we fit our figures in a 1row-by-1colume array
hist(under40$Percent_Leave)

hist(over40$Percent_Leave)

Now that you know how the Leave vote is distributed across the dataset, what would your best guess be if you had to give one likely value for a Leave percentage in any given constituency?

Remember that we know that the Leave percentage varies across constituencies. There will always be errors, but we want to minimize the SSE - the sum of squared errors - from our fitted straight line from above. In principle, the single value that is likely to be closest to the Leave vote share in all constituencies and thus minimizes the sum of squared errors is the mean!

But, in reality, there is meaningful variation among units and some other factors might be useful in explaining the variation in the Leave vote. Would the mean age of a constituency explain the level of voting Leave? Health standards of the people living there? Their income level? The first step in regression analysis is to think of a plausible relationship between two variables and then unravel this relationship.


Bivariate Relationships

To understand overall relationships, scatterplots can be a more effective way than simply comparing histograms or maps.

Let’s plot a well known correlation that came to play during the Brexit referendum: age. Do you see a relationship in the graph? We can add the lines of the means of age and Leave vote share to see how the observations are distributed with respect to these values.

par(mfrow=c(1,2))
plot(brexit_data$age_mean, brexit_data$Percent_Leave)

plot(brexit_data$age_mean, brexit_data$Percent_Leave)

abline(v=mean(brexit_data$age_mean, na.rm=TRUE), col = "red")
abline(h=mean(brexit_data$Percent_Leave[!is.na(brexit_data$age_mean)], na.rm=TRUE),
       # we do not want to include missing values
       col = "blue")

#"[!is.na(age_mean)], na.rm=T": tells R not to use missing values in age.
#missing values would keep R from calculating the mean and its relationship with the Leave variable.
par(mfrow=c(1,1))

Regression Analysis

The Logic of Regressions

Recall that the regression coefficient is calculated based on the covariance of \(X\) and \(Y\). If the values of the observations in a sample differ a lot, the variance is high, if the values are relatively similar, the variance is low.

For instance, take the list of numbers 2, 9, 67. We calculate the variance by summing the squared difference between each number and the mean, and divide by the number of observations (minus one if the numbers are a sample rather than a population):

\[ \sigma^2=\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}{n-1} \]

numbers <- c(2,9,67)
#to create a vactor containing the numbers 2, 9 and 67
numbers
[1]  2  9 67
num_mean <- mean(numbers)
#to compute the mean value for all the observations in the vector numbers
num_mean
[1] 26
num_var_sample <- 1/(3-1) * ((2-num_mean)^2 + (9-num_mean)^2 + (67-num_mean)^2)
#to compute the sample variance
num_var_sample
[1] 1273
num_var_pop <- 1/(3) * ((2-num_mean)^2 + (9-num_mean)^2 + (67-num_mean)^2)
#to compute the population variance
num_var_pop
[1] 848.6667
# More simply:
var(numbers) # note var() automatically makes a small-sample adjustment
[1] 1273

The variance is high (849 if the numbers represent the whole population, 1273 if it concerns a sample).

Recall that the variance represents the average squared distance between observations and the mean. The standard deviation, accordingly, is the root of this average distance: 36.

sd(numbers)
[1] 35.67913

To summarise the relationship between two variables, we can calculate the correlation. Correlations are a standardised and non-unit-specific way of measuring the relationship between two variables. It represents the degree to which one variable is associated with another.

cor(brexit_data$age_mean, brexit_data$Percent_Leave, use = "complete")
[1] 0.3825466
#the correlation is 0.38 on a scale from -1 to 1.

Since the correlation is positive, we know that as age increases so too does the percentage of Leave voters in the regional unit.

But how much does a one year increase in age add to the Leave vote in general terms? Let’s use regression to find out!

In linear regression analysis, we get the coefficient \(\beta\) by dividing the covariance of \(X\) and \(Y\) (the extent to which \(X\) and \(Y\) move together) by the variance of \(X\) (how much variation there is in \(X\), simply speaking). In other words: To what extent do the different values that \(X\) takes explain the relationship between \(X\) and \(Y\)?

\[ \hat{\beta}=\frac{\operatorname{COV}(X, Y)}{\operatorname{VAR}(X)} \]

To calculate the coefficient manually, we start by calculating the covariance of age and Leave voting:

age_leave_cov <- cov(brexit_data$Percent_Leave, brexit_data$age_mean, use = "complete")
#note the extra "use" command at the end.
#missing observations should be ignored, otherwise the command won't run.
age_leave_cov
[1] 10.96389

Next we can calculate the variance in our \(X\)-variable, i.e. age:

age_var <- var(brexit_data$age_mean, use = "complete")
age_var
[1] 8.515135

Finally, we can divide the covariance of \(X\) and \(Y\) by the variance of \(X\) to estimate the regression coefficient \(\beta\):

beta <- age_leave_cov/age_var
beta
[1] 1.287577

We see that - based on our linear estimation - a one year increase in the mean age of the electoral district is associated with an increase in the Leave vote share of the area by 1.288 %-points on average.


Linear Regression in R

Doing this calculation manually, however, is time-consuming. Fortunately, R can easily calculate the regression coefficients for us (including the intercept value, too, which is automatically added to the model):

mod1 <- lm(brexit_data$Percent_Leave ~ brexit_data$age_mean)
#note that the results are stored in the working environment. Using the output of the lm command is not the most efficient way to interpret the regression - feel free to try and compare it to the code below.

#Usually, we print the output using the summary command:
summary(mod1)

Call:
lm(formula = brexit_data$Percent_Leave ~ brexit_data$age_mean)

Residuals:
     Min       1Q   Median       3Q      Max 
-31.2068  -5.6726   0.5681   6.1777  22.2880 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)            2.6113     6.7841   0.385    0.701    
brexit_data$age_mean   1.2876     0.1682   7.657 1.97e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.088 on 342 degrees of freedom
  (38 observations deleted due to missingness)
Multiple R-squared:  0.1463,    Adjusted R-squared:  0.1438 
F-statistic: 58.63 on 1 and 342 DF,  p-value: 1.97e-13

You can see that the coefficient value for age_mean is the same as the manual value (\(\beta\)) we computed above.

Next, we might wonder how much of the variation in \(Y\) is explained by the variation in \(X\). We can find out by asking for the R-squared, which broadly indicates what portion of the variation is being explained by the regression model:

summary(mod1)$r.squared
[1] 0.1463419

We get an R-squared of 0.1463, which means that about 15 percent of the variation in the Leave vote is explained by age. Regression tells us what accounts for the variation in the outcome when the mean (in this case the mean Leave vote) is not the best explanation for a value of \(Y\). This means that our model is “this much better” than using the mean. Here, we conclude that age does account for a meaningful part of the observed variation: constituencies with a higher mean age are more likely to vote Leave. So, we get a better prediction of how a constituency votes by knowing the age of the voters than just relying on the mean Leave vote.

Moreover, when we fit the our estimated linear regression line, we can predict the Leave vote share for mean ages which are not included in our data.

plot(brexit_data$Percent_Leave ~ brexit_data$age_mean, xlim =c(30,50),
ylim =c(3,80), ylab = "Leave vote share", xlab = "Mean age", pch=16, col="blue")
abline(mod1)

# Recall why we specify the arguments in the command above: We want to create a figure that is as insightful, clear and parsimonious as possible!
#xlim = what are the limits of the x-axis? Mean ages are between 30 and 50.
#we can find this out by
min(brexit_data$age_mean, na.rm=TRUE)
[1] 30.9
#alternative code:min(brexit_data$age_mean[!is.na(brexit_data$age_mean)])
max(brexit_data$age_mean, na.rm=TRUE)
[1] 47.7
#alternative code:max(brexit_data$age_mean[!is.na(brexit_data$age_mean)])
#ylim = what are the limits of the y-axis?
min(brexit_data$Percent_Leave)
[1] 4.085381
max(brexit_data$Percent_Leave) #no missing values here
[1] 75.56243
#Leave vote share between 4 and 76
# xlab and ylab for naming the y and x axes
# pch and col just to make the dots prettier!
#if you just want to get a simple graph as a first idea, just run
plot(brexit_data$Percent_Leave ~ brexit_data$age_mean)
abline(mod1)

At the constituency mean age of 40.23808, we predict a 54 percent vote share of Leave. You can also calculate this manually without the graph by using the equation

\[y=bx\] Based on our simple bivariate model, our estimated value of the Leave vote will be

\[2.611 + (1.288*age_{mean}) = 2.611 + (40.23808*1.288) = 54.44 \] which is very close to what we got by looking at the graph.

Manual calculation:

#estimated % Leave share = 2.611 + 1.288 * mean age
mean(brexit_data$age_mean,na.rm = TRUE)
[1] 40.23808
beta # as we calculated before
[1] 1.287577
2.611+1.288*40.23808
[1] 54.43765
# Recall that you can call the estimated intercept and regression slope from the OLS model we saved earlier
mod1$coefficients[1]+mod1$coefficients[2]*mean(brexit_data$age_mean,na.rm = TRUE)
(Intercept) 
   54.42089 
#estimated share at another value of age:
mod1$coefficients[1]+mod1$coefficients[2]*45
(Intercept) 
   60.55223 

We now turn to a different question: Did immigration levels correlate with the Leave vote - and, if so, can we predict the Leave vote share based on the level of immigration in a given constituency?

Let’s try to answer this by regressing the Leave vote share on the percentage of UK-born residents in each constituency.

UKborn <- lm(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent)
summary(UKborn) #to print out the result

Call:
lm(formula = brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.4253  -4.7247  -0.0025   4.4336  23.4417 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   2.72722    3.63142   0.751    0.453    
brexit_data$Birth_UK_percent  0.58210    0.04062  14.331   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.775 on 342 degrees of freedom
  (38 observations deleted due to missingness)
Multiple R-squared:  0.3752,    Adjusted R-squared:  0.3734 
F-statistic: 205.4 on 1 and 342 DF,  p-value: < 2.2e-16

We get a coefficient of 0.58210 for Birth_UK_percent. This means that a one percent increase in the proportion of UK-born individuals is associated with an average 0.58 %-point rise in the Leave vote share. The more UK-born residents live in the electoral district, the higher the Leave vote!

Let’s also look at the correlation between these two variables:

cor(brexit_data$Percent_Leave, brexit_data$Birth_UK_percent, use="complete")
[1] 0.6125353

How do you interpret the regression coefficient? Is it a better predictor than age? Let’s plot this and include our line of best fit.

plot(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, pch=16, col="blue")
abline(UKborn)

Now, let’s try to look at this same relationship by creating a variable for the share of foreign-born residents in each constituency, before regressing the Leave vote on this newly created variable. [Can you think of another way to create this variable?]

brexit_data$foreignborn_percentage <- brexit_data$Birth_other_EU_percent +
brexit_data$Birth_Africa_percent +
brexit_data$Birth_MidEast_Asia_percent +
brexit_data$Birth_Americas_Carrib_percent +
brexit_data$Birth_Antarctica_Oceania_Other_percent
#foreignborn_percentage is the new variable

cor(brexit_data$foreignborn_percentage,
    brexit_data$Percent_Leave, use="complete") #this time negative
[1] -0.5973909
foreigners <- lm(brexit_data$Percent_Leave ~ brexit_data$foreignborn_percentage)
summary(foreigners)

Call:
lm(formula = brexit_data$Percent_Leave ~ brexit_data$foreignborn_percentage)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.6996  -4.6138   0.0424   4.5006  23.9212 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         60.6691     0.6218   97.58   <2e-16 ***
brexit_data$foreignborn_percentage  -0.6199     0.0450  -13.78   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.888 on 342 degrees of freedom
  (38 observations deleted due to missingness)
Multiple R-squared:  0.3569,    Adjusted R-squared:  0.355 
F-statistic: 189.8 on 1 and 342 DF,  p-value: < 2.2e-16
#foreigners stores the results of this regression, to view just type "foreigners"

plot(brexit_data$Percent_Leave ~ brexit_data$foreignborn_percentage, pch=16, col="blue")
abline(foreigners)

A concern we may have at this point is whether our results are possibly driven by an outlier. We can get a better idea about what is influencing our estimates by asking R to label our observations.

plot(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, type="n")
text(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, labels=brexit_data$Region)

We don’t really see much, but we can see that areas on the lower end of both the Leave vote share and UK-born residents are located in London. Let’s subset the data to exclude London, and see if our estimates change.

#how many districts are in the London region?
length(brexit_data$Region[brexit_data$Region=="London"])
[1] 33
#we subset the data to exlude the 33 areas in London
withoutLondon <- subset(brexit_data, Region != "London")
dim(withoutLondon)
[1] 349  62
#The new number of observations is 349.
#This comes from substracting London's 33 areas form the original total of 382.

#and now we re-run the regression
immigWL <- lm(withoutLondon$Percent_Leave ~ withoutLondon$Birth_UK_percent)
summary(immigWL) #what happened to our coefficient?

Call:
lm(formula = withoutLondon$Percent_Leave ~ withoutLondon$Birth_UK_percent)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.9939  -4.4872  -0.0867   4.4734  22.9108 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     9.16900    6.94355   1.321    0.188    
withoutLondon$Birth_UK_percent  0.51244    0.07576   6.764 6.74e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.375 on 309 degrees of freedom
  (38 observations deleted due to missingness)
Multiple R-squared:  0.129, Adjusted R-squared:  0.1261 
F-statistic: 45.75 on 1 and 309 DF,  p-value: 6.738e-11

What do you make of this estimate? How does it relate to your previous model? Provide a substantive interpretation.

Regression models often provide the most important insights in data analysis - it’s not by chance that they have become the gold standard of social science analysis. Let’s present our regression result in a nice way - we can use the stargazer package to create a decent-looking regression table.

#install.packages("stargazer") #note the quotation marks
library(stargazer)

fit_example <- lm(c(1,2,3,4,5,6) ~ c(4,9,8,1,2,3))
stargazer(fit_example, type="text")

===============================================
                        Dependent variable:    
                    ---------------------------
                        c(1, 2, 3, 4, 5, 6)    
-----------------------------------------------
c(4, 9, 8, 1, 2, 3)           -0.308           
                              (0.241)          
                                               
Constant                      4.888**          
                              (1.301)          
                                               
-----------------------------------------------
Observations                     6             
R2                             0.291           
Adjusted R2                    0.113           
Residual Std. Error       1.761 (df = 4)       
F Statistic              1.640 (df = 1; 4)     
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01
stargazer(mod1,immigWL, type="text")

===================================================================
                                  Dependent variable:              
                    -----------------------------------------------
                         Percent_Leave           Percent_Leave     
                              (1)                     (2)          
-------------------------------------------------------------------
age_mean                   1.288***                                
                            (0.168)                                
                                                                   
Birth_UK_percent                                   0.512***        
                                                    (0.076)        
                                                                   
Constant                     2.611                   9.169         
                            (6.784)                 (6.944)        
                                                                   
-------------------------------------------------------------------
Observations                  344                     311          
R2                           0.146                   0.129         
Adjusted R2                  0.144                   0.126         
Residual Std. Error    9.088 (df = 342)        7.375 (df = 309)    
F Statistic         58.629*** (df = 1; 342) 45.747*** (df = 1; 309)
===================================================================
Note:                                   *p<0.1; **p<0.05; ***p<0.01

Exercise

Immigration Attitudes: The Role of Economic and Cultural Threat

Why do the majority of voters in the U.S. and other developed countries oppose increasing immigration? According to the conventional view and many economic theories, people simply do not want to face additional competition on the labor market (the economic threat hypothesis). Nonetheless, most comprehensive empirical tests have failed to confirm this hypothesis and it appears that people actually often support policies that are not in their personal economic interest. At the same time, there has been growing evidence that immigration attitudes are rather influenced by various deep-rooted ethnic and cultural stereotypes (cultural threat hypothesis). Given the prominence of workers’ economic concerns in the political discourse, how can these findings be reconciled?

This exercise is based in part on Malhotra, N., Margalit, Y. and Mo, C.H., 2013. “Economic Explanations for Opposition to Immigration: Distinguishing between Prevalence and Conditional Impact.” American Journal of Political Science, Vol. 38, No. 3, pp. 393-433.

The authors argue that, while job competition is not a prevalent threat and therefore may not be detected by aggregating survey responses, its conditional impact in selected industries may be quite sizable. To test their hypothesis, they conduct a unique survey of Americans’ attitudes toward H-1B visas. The plurality of H-1B visas are occupied by Indian immigrants, who are skilled but ethnically distinct, which enables the authors to measure a specific skill set (high technology) that is threatened by a particular type of immigrant (H-1B visa holders). The data set immig.csv has the following variables:

Variable Name Variable Description
age Age (in years)
female Dummy: 1 indicates female; 0 indicates male
employed Dummy: 1 indicates employed; 0 indicates unemployed
nontech.whitcol Dummy: 1 indicates non-tech white-collar work (e.g. law), 0 everything else
tech.whitcol Dummy: 1 indicates high-technology work, 0 everything else
expl.prejud Explicit negative stereotypes about Indians (continuous scale, 0-1)
impl.prejud Implicit bias against Indian Americans (continuous scale, 0-1)
h1bvis.supp Support for increasing H-1B visas (5-point scale, 0-1)
indimm.supp Support for increasing Indian immigration (5-point scale, 0-1)

The main outcome of interest (h1bvis.supp) was measured as a following survey item: “Some people have proposed that the U.S. government increase the number of H-1B visas, which are allowances for U.S. companies to hire workers from foreign countries to work in highly skilled occupations (such as engineering, computer programming, and high-technology). Do you think the U.S. should increase, decrease, or keep about the same number of H-1B visas?” A higher value reflects greater support for increasing H-1B visas. Another outcome (indimm.supp) similarly asked about the “the number of immigrants from India.” Both variables have the following response options: 0 = “decrease a great deal”, 0.25 = “decrease a little”, 0.5 = “keep about the same”, 0.75 = “increase a little”, 1 = “increase a great deal”.

To measure explicit stereotypes (expl.prejud), respondents were asked to evaluate Indians on a series of traits: capable, polite, hardworking, hygienic, and trustworthy. All responses were then used to create a scale lying between 0 (only positive perceptions of traits of Indians) to 1 (no positive perceptions of traits of Indians). Implicit bias (impl.prejud) is measured via the Implicit Association Test (IAT) which is an experimental method designed to gauge the strength of associations linking social categories (e.g., European vs Indian American) to evaluative anchors (e.g., good vs bad). Individuals who are prejudiced against Indians should be quicker at making classifications of faces and words when European American (Indian American) is paired with good (bad) than when European American (Indian American) is paired with bad (good). If you want, you can test yourself here.

Task 1

Start by examining the distribution of immigration attitudes (as factor variables). What is the proportion of people who are willing to increase the quota for high-skilled foreign professionals (h1bvis.supp) or support immigration from India (indimm.supp)?

Now compare the distribution of two distinct measures of cultural threat: explicit stereotyping about Indians (expl.prejud) and implicit bias against Indian Americans (impl.prejud). Create a scatter plot, add a linear regression line to it, and calculate the correlation coefficient. Based on these results, what can you say about their relationship?

#load data

immig <- read.csv("immig.csv")

#Descriptive Statistics
dim(immig)
[1] 1133   10
#table(immig$h1bvis.supp)
summary(immig$h1bvis.supp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.2500  0.3471  0.5000  1.0000      11 
#table(immig$indimm.supp)
summary(immig$indimm.supp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.5000  0.3523  0.5000  1.0000      11 
#table(immig$expl.prejud)
summary(immig$expl.prejud)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.250   0.350   0.355   0.450   1.000     238 
#table(immig$impl.prejud)
summary(immig$impl.prejud)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.4723  0.5896  0.5743  0.6886  1.0000     124 
# Answer 1
prop.table(table(immig$h1bvis.supp))

         0       0.25        0.5       0.75          1 
0.30748663 0.22727273 0.29857398 0.10249554 0.06417112 
prop.table(table(immig$indimm.supp))

         0       0.25        0.5       0.75          1 
0.28787879 0.18092692 0.39839572 0.09982175 0.03297683 

About half of all voters would like to decrease the number of H-1B visas and Indian immigration (30+23=53% and 29+18=47% respectively) and about a third would like to maintain the status quo (30% and 40%). At the same time, only a minority of voters would like to see immigration increase (10+6=16% and 10+3=13%).

scatter.smooth(jitter(immig$expl.prejud, 3),
immig$impl.prejud,
xlab = "Explicit prejudice",
ylab = "Implicit prejudice")
# Because the independent variable is only observed at a few levels,
#it can be difficult to get a sense of the "cloud" of points.
#We can use jitter to a small amount of noise to a numeric
# vector (expl.prejud) to see the cloud more clearly.

fit1 <- lm(impl.prejud~ expl.prejud, data = immig)
abline(fit1, col = "red")
legend(x='topright', bty = "n", legend=paste('Cor =', round(cor(immig$expl.prejud, immig$impl.prejud, use="complete"), 5)))

The scatterplot shows that people of low or moderate explicit prejudice are more or less equally likely to have any level of implicit prejudice. However, it is instructive that almost none of the respondents of high explicit prejudice have low implicit prejudice. Overall, while the relationship between implicit and explicit prejudice is (slightly) positive (as can be expected), the correlation coefficient is very low (<0.1) which may indicate that these are in fact two distinct attitudinal constructs.

Task 2

To investigate the impact of cultural threats on policy attitutes, regress H-1B and Indian immigration attitudes on explicit and implicit prejudges (expl.prejud and impl.prejud), respectively, so that you have four different models in total. Create a scatter plot and add a linear regression line to it. Do you agree that cultural threat is an important predictor of immigration attitudes as claimed in the literature?

If the labor market hypothesis is correct, opposition to H-1B visas should also be more pronounced among those who are economically threatened by this policy such as individuals in the high-technology sector. At the same time, tech workers should not be more or less opposed to general Indian immigration because of any specific economic considerations. First, regress both measures of immigration on the indicator variable for tech workers (tech.whitcol). Simply speaking, is there a relationship between the dependent variable (immigration attitudes) and our explanatory variable (tech workers)? How would you interpret the coefficients for tech.whitcol? Do the results support the hypothesis? Is the relationship different from the one involving cultural threat and, if so, how? What does the R-squared tell us about model fit?

# Excluding missing values
vars<-c("h1bvis.supp", "expl.prejud", "indimm.supp", "impl.prejud", "tech.whitcol")
# We construct a new dataframe in order not to do any changes in the original data
immig_nm<- immig[complete.cases(immig[vars]),]
# you might need to exclude missing values to use certain commands - don't worry though if you haven't done this in your script 
#2.1
# Regressing H-1B visa support on explicit negative stereotypes
m1<-lm(h1bvis.supp ~ expl.prejud,data=immig_nm)
summary(m1)

Call:
lm(formula = h1bvis.supp ~ expl.prejud, data = immig_nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.44031 -0.29476 -0.05799  0.16555  0.78463 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.44031    0.02211  19.912  < 2e-16 ***
expl.prejud -0.26464    0.05568  -4.752 2.34e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2965 on 893 degrees of freedom
Multiple R-squared:  0.02467,   Adjusted R-squared:  0.02358 
F-statistic: 22.59 on 1 and 893 DF,  p-value: 2.342e-06
par(mfrow=c(2,2)) # combines the plots into 2x2 'matrix'
scatter.smooth(jitter(immig_nm$expl.prejud, 3),
jitter(immig_nm$h1bvis.supp, 3),
xlab = "Explicit prejudice",
ylab = "H-1B Visa Support")
abline(m1, col="green")

#NOTE: You can create exactly the same scatter plot using the plot() command.
#      scatter.smooth allows you to fit non-linear lines and objects, 
#      which might turn out helpful further down the road.
#plot(jitter(immig_nm$expl.prejud, 3),
#jitter(immig_nm$h1bvis.supp, 3),
#xlab = "Explicit prejudice",
#ylab = "H-1B Visa Support")
#abline(m1, col="green")



# Regressing Indian Immigration support on explicit
#negative stereotypes
m2<-lm(indimm.supp~ expl.prejud,data=immig_nm)
summary(m2)

Call:
lm(formula = indimm.supp ~ expl.prejud, data = immig_nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.53281 -0.20898  0.03711  0.16406  0.76250 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.53281    0.01927   27.65   <2e-16 ***
expl.prejud -0.49218    0.04852  -10.14   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2584 on 893 degrees of freedom
Multiple R-squared:  0.1033,    Adjusted R-squared:  0.1023 
F-statistic: 102.9 on 1 and 893 DF,  p-value: < 2.2e-16
scatter.smooth(jitter(immig_nm$expl.prejud, 3),
jitter(immig_nm$indimm.supp, 3),
xlab = "Explicit prejudice",
ylab = "Indian Immigration Support")
abline(m2, col="green")


# Regressing H-1B visa support on Implicit stereotypes
m3<-lm(h1bvis.supp ~ impl.prejud,data=immig_nm)
summary(m3)

Call:
lm(formula = h1bvis.supp ~ impl.prejud, data = immig_nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46800 -0.31194 -0.06019  0.17095  0.71449 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.46800    0.03706  12.629  < 2e-16 ***
impl.prejud -0.21045    0.06175  -3.408 0.000684 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2983 on 893 degrees of freedom
Multiple R-squared:  0.01284,   Adjusted R-squared:  0.01173 
F-statistic: 11.61 on 1 and 893 DF,  p-value: 0.0006835
scatter.smooth(jitter(immig_nm$impl.prejud, 3),
jitter(immig_nm$h1bvis.supp, 3),
xlab = "Implicit prejudice",
ylab = "H-1B Visa Support")
abline(m3, col="green")

#Regressing Indian Immigration support on Implicit stereotypes
m4<-lm(indimm.supp ~ impl.prejud,data=immig_nm)
summary(m4)

Call:
lm(formula = indimm.supp ~ impl.prejud, data = immig_nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46677 -0.31203  0.09446  0.16341  0.67753 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.48459    0.03361  14.419   <2e-16 ***
impl.prejud -0.21884    0.05600  -3.908    1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2705 on 893 degrees of freedom
Multiple R-squared:  0.01681,   Adjusted R-squared:  0.01571 
F-statistic: 15.27 on 1 and 893 DF,  p-value: 0.0001002
scatter.smooth(jitter(immig_nm$impl.prejud, 3),
jitter(immig_nm$indimm.supp, 3),
xlab = "Implicit prejudice",
ylab = "Indian Immigration Support")
abline(m4, col="green")

dev.off() # Overrides par(mfrow...)- command
null device 
          1 

Both measures of cultural threat are negatively related to both measures of immigration support. Consequently, people with less positive traits of Indians are more likely to oppose an increase in H-1B visas or rather are more likely to be biased against Indian Americans.

Model 1: beginning from a constant off 0,4403, every unit increase in explicit prejudices leads to a 0,26 decrease in H1-visas support. Given the regression linear regression equation Y=\(\alpha + \beta X\), we can say that a person with no explicit negative sterreotypes about Indians (X=0) would support (on average) an increase in H-1B visas by 0,4403. Thus, even people with no prejudices prefer to keep it about the same. A person with no positive traits (x=1), on the other hand, supports (on average) an increase in H-1B visas only by 0.176 (0.4403-(0.264 * 1)). In other words, they would prefer to decrease it more than just a little instead of actually increasing it. (Not least because our units are measured in proportions (unites between 0 and 1) we can also think in percentages: 44% and 18) The variable is highly statistical significant (***). Consequently, we can assume that this relationship is not random.

Note that the scatter.smooth() command allows you to fit non-linear objects such as lines. By default, the plot comes with a local polynomial regression that helps indicate a non-linear overall fit. By substantive terms, we see that the lines are almost identical. For low levels of explicit prejudices (no negative stereotypes) the relationship towards Visa support seems to remain constant. The slope or rather the relationship becomes slightly negative for higher level of prejudices. Our OLS line also indicates a negative relationship. However, it diverges from our nonparametric regression lines towards both tails. Possible reasons for this deviation might be related to a) the total number of observations for the given values of x and b) possible outliers that are changing the overall slope of the straight line.

Model 2-3: Similar Idea By comparing all four models, we can see that the negative relationship is particularly strong when it comes to the link between stereotypes about Indians and attitudes toward Indian immigration.

Note that the loess regression line and the kernel smooth can change according to the bandwidth/span and the degree of local polynomial used.

#2.2.
m5<-lm(h1bvis.supp ~ tech.whitcol, data = immig)
summary(m5)

Call:
lm(formula = h1bvis.supp ~ tech.whitcol, data = immig)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.34995 -0.34995 -0.09995  0.15005  0.70339 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.349953   0.009211  37.994   <2e-16 ***
tech.whitcol -0.053343   0.040166  -1.328    0.184    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3003 on 1120 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.001572,  Adjusted R-squared:  0.0006808 
F-statistic: 1.764 on 1 and 1120 DF,  p-value: 0.1844
m6<-lm(indimm.supp ~ tech.whitcol, data = immig)
summary(m6)

Call:
lm(formula = indimm.supp ~ tech.whitcol, data = immig)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3525 -0.3525  0.1475  0.1475  0.6525 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.352540   0.008460  41.674   <2e-16 ***
tech.whitcol -0.005082   0.036891  -0.138     0.89    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2758 on 1120 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  1.695e-05, Adjusted R-squared:  -0.0008759 
F-statistic: 0.01898 on 1 and 1120 DF,  p-value: 0.8904

Overall, the results may provide some cautious support for the labor market hypothesis. As expected, while tech workers are slightly more opposed to H-1B visas, they are not more opposed to Indian immigration in general (compare to those who are not tech workers). As one may expect, this relationship is also in contrast with the one of cultural threat which is negatively related to both measures of immigration attitudes.